Data driven detection of viruses

ABSTRACT

A virus detection system (VDS) ( 400 ) operates under the control of P-code to detect the presence of a virus in a file ( 100 ) having multiple entry points. P-code is an intermediate instruction format that uses primitives to perform certain functions related to the file ( 100 ). The VDS ( 400 ) executes the P-code, which provides Turing-equivalent capability to the VDS. The VDS ( 400 ) has a P-code data file ( 410 ) for holding the P-code, a virus definition file (VDF) ( 412 ) for holding signatures of known viruses, and an engine ( 414 ) for controlling the VDS. The engine ( 414 ) contains a P-code interpreter ( 418 ) for interpreting the P-code, a scanning module ( 424 ) for scanning regions of the file ( 100 ) for the virus signatures in the VDF ( 412 ), and an emulating module ( 426 ) for emulating entry points of the file. When executed, the P-code examines the file ( 100 ), posts ( 514 ) regions that may be infected by a virus for scanning, and posts ( 518 ) entry points that may be infected by a virus for emulating. The P-code can also detect ( 520 ) certain viruses algorithmically. Then, the posted regions and entry points of the file ( 100 ) are scanned ( 526 ) and emulated ( 534 ) to determine if the file is infected with a virus. This technique allows the VDS ( 400 ) to perform sophisticated analysis of files having multiple entry points in a relatively brief amount of time. In addition, the functionality of the VDS ( 400 ) can be changed by changing the P-code, reducing the need for burdensome engine updates.

BACKGROUND FIELD OF THE INVENTION

This invention pertains in general to detecting viruses within files indigital computers and more particularly to detecting the presence of avirus in a file having multiple entry points.

BACKGROUND OF THE INVENTION

Simple computer viruses work by copying exact duplicates of themselvesto each executable program file they infect. When an infected programexecutes, the simple virus gains control of the computer and attempts toinfect other files. If the virus locates a target executable file forinfection, it copies itself byte-for-byte to the target executable file.Because this type of virus replicates an identical copy of itself eachtime it infects a new file, the simple virus can be easily detected bysearching in files for a specific string of bytes (i.e. a “signature”)that has been extracted from the virus.

Encrypted viruses comprise a decryption routine (also known as adecryption loop) and an encrypted viral body. When a program fileinfected with an encrypted virus executes, the decryption routine gainscontrol of the computer and decrypts the encrypted viral body. Thedecryption routine then transfers control to the decrypted viral body,which is capable of spreading the virus. The virus is spread by copyingthe identical decryption routine and the encrypted viral body to thetarget executable file. Although the viral body is encrypted and thushidden from view, these viruses can be detected by searching for asignature from the unchanging decryption routine.

Polymorphic encrypted viruses (“polymorphic viruses”) comprise adecryption routine and an encrypted viral body which includes a staticviral body and a machine-code generator often referred to as a “mutationengine.” The operation of a polymorphic virus is similar to theoperation of an encrypted virus, except that the polymorphic virusgenerates a new decryption routine each time it infects a file. Manypolymorphic viruses use decryption routines that are functionally thesame for all infected files, but have different sequences ofinstructions.

These multifarious mutations allow each decryption routine to have adifferent signature. Therefore, polymorphic viruses cannot be detectedby simply searching for a signature from a decryption routine. Instead,antivirus software uses emulator-based antivirus technology, also knownas Generic Decryption (GD) technology, to detect the virus. The GDscanner works by loading the program into a software-based CPU emulatorwhich acts as a simulated virtual computer. The program is allowed toexecute freely within this virtual computer. If the program does in factcontain a polymorphic virus, the decryption routine is allowed todecrypt the viral body. The GD scanner can then detect the virus bysearching through the virtual memory of the virtual computer for asignature from the decrypted viral body.

Metamorphic viruses are not encrypted but vary the instructions in theviral body with each infection of a host file. Accordingly, metamorphicviruses often cannot be detected with a string search because they donot have static strings.

Regardless of whether the virus is simple, encrypted, polymorphic, ormetamorphic, the virus typically infects an executable file by attachingor altering code at or near an “entry point” of the file. An “entrypoint” is an instruction or instructions in the file that a virus canmodify to gain control of the computer system on which the file is beingexecuted. Many executable files have a “main entry point” containinginstructions that are always executed when the program is invoked.Accordingly, a virus seizes control of the program by manipulatingprogram instructions at the main entry point to call the virus insteadof the program. The virus then infects other files on the computersystem.

When infecting a file, the virus typically stores the viral body at themain entry point, at the end of the program file, or at some otherconvenient location in the file. When the virus completes execution, itcalls the original program instructions that were altered by the virus.

In order to detect the presence of a virus, antivirus software typicallyscans the code near the main entry point, and other places where theviral body is likely to reside, for strings matching signatures held ina viral signature database. In addition, the antivirus software emulatesthe code near the main entry point in an effort to decrypt any encryptedviral bodies. Since viruses usually infect only the main entry point,the antivirus software can scan and emulate a file relatively quickly.When new viruses are detected, the antivirus software can be updated byadding the new viral signatures to the viral signature database.

More recently, however, viruses have been introduced that infect entrypoints other than the main entry point. As a result, the number ofpotential entry points for a viral infection in a typical search space,such as a MICROSOFT WINDOWS portable executable (PE) file, is verylarge. Prior art antivirus software would require an extremely longprocessing time to scan and/or emulate the code surrounding all of theentry points in the file that might be infected by a virus.

Moreover, the multiple entry points provide opportunities for viruses touse previously unknown methods to infect a file. As a result, it may notbe possible to detect the virus merely by adding a new signature to theviral signature database. In many cases, the virus detection systemitself must be updated with hand-coded virus detection routines in orderto detect the new viruses. Writing custom detection routines andupdating the antivirus software requires a considerable amount of work,especially when the antivirus software is distributed to a mass market.

Therefore, there is a need in the art for antivirus software that candetect viruses in PE and other files having multiple entry pointswithout requiring a prohibitively large amount of processing time. Thereis also a need that the antivirus software be easily upgradeable, sothat new virus detection capabilities can be added without requiringhand-coded virus detection logic or needing to distribute a new virusdetection engine.

SUMMARY OF THE INVENTION

The above needs are met by a virus detection system (VDS) (400) fordetecting the presence of a virus in a file (100) having multiple entrypoints. The VDS (400) preferably includes a data file (410) holdingP-code instructions. P-code is an interpreted language that provides theVDS (400) with Turing machine-equivalent behavior, and allows the VDS tobe updated by merely updating the P-code. The VDS (400) also includes avirus definition file (VDF) (412) containing virus signatures for knownviruses. Each virus signature is a string of bytes characteristic of thestatic viral body of the given virus.

The VDS (400) is controlled by an engine (414) having a P-codeinterpreter (418) for interpreting the P-code in the data file (410).The P-code interpreter (418) may also contain primitives (420) that canbe invoked by the P-code. Primitives are functions that can be called bythe P-code. The primitives (420) preferably perform file and memorymanipulations, and can also perform other useful tasks. In addition, theengine (414) has a scanning module (424) for scanning a file or range ofmemory for virus signatures in the VDF (412) and an emulating module(426) for emulating code in the file (100) in order to decryptpolymorphic viruses and detect the presence of metamorphic viruses.

The engine (414) interprets the P-code in the P-code data file (410) andresponds accordingly. In one embodiment, the P-code examines the entrypoints in the file (100) to determine whether the entry points might beinfected with a virus. Those entry points and other regions of the file(100) commonly infected by viruses or identified by suspiciouscharacteristics in the file, such as markers left by certain viruses,are posted (514) for scanning. Likewise, the P-code posts (518) entrypoints and starting contexts for regions of the file (100) that arecommonly infected by viruses or bear suspicious characteristics foremulating. Using the P-code to preprocess regions of the file (100) andselect only those regions or entry points that are likely to contain avirus for subsequent scanning and/or emulating allows the VDS (400) toexamine files for viruses that infect places other than the main entrypoint in a reasonable amount of time. The P-code can also determinewhether the file (100) is infected with a virus by using virus detectionroutines written directly into the P-code, thereby eliminating the needto scan for strings or emulate the file (100).

A region posted for string scanning is identified by a range of memoryaddresses. Preferably, the P-code merges postings having overlappingranges so that a single posting specifies the entire region to bescanned. When an entry point is posted for emulating, the P-codespecifies the emulation context, or starting state to be used for theemulation. An entry point can be posted multiple times with differentcontexts for each emulation.

The engine (414) uses the scanning module (424) to scan the regions ofthe file (100) that are posted for scanning by the P-code for the virussignatures in the VDF (412). If the scanning module (424) detects avirus, the VDS (400) preferably reports that the file (100) is infectedand stops operation.

If the scanning module (424) does not find a virus in the postedregions, a preferred embodiment of the present invention optionallyutilizes a hook to call (530) custom virus detection code. The hookallows virus detection engineers to insert a custom program into the VDS(400) and detect viruses that, for reasons of speed and efficiency, arebetter detected by custom code.

Then, the VDS (400) preferably uses the emulating module (426) toemulate the posted entry points. Preferably, each posted entry point isemulated for enough instructions to allow polymorphic and metamorphicviruses to decrypt or otherwise become apparent. Once emulation iscomplete, the VDS (400) uses the scanning module (424) to scan pages ofthe virtual memory (434) that were either modified or emulated throughfor signatures of polymorphic viruses and uses stochastic informationobtained during the emulation, such as instruction usage profiles, todetect metamorphic viruses. If the scanning module (424) or VDS (400)detects a virus, the VDS reports that the file (100) is infected.Otherwise, the VDS (400) reports that it did not detect a virus in thefile (100).

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a high-level block diagram of a conventional executable file100 having multiple entry points that can be infected by a virus;

FIG. 2 is a high-level block diagram of a computer system 200 forstoring and executing the file 100 and a virus detection system (VDS)400;

FIG. 3 is a flow chart illustrating steps performed by a typical viruswhen infecting the file 100;

FIG. 4 is a high-level block diagram of the VDS 400 according to apreferred embodiment of the present invention; and

FIG. 5 is a flow chart illustrating steps performed by the VDS 400according to a preferred embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

In order to accomplish the mischief for which they are designed,software viruses must gain control of a computer's central processingunit (CPU). Viruses typically gain this control by attaching themselvesto an executable file (the “host file”) and modifying the executableimage of the host file at an entry point to pass control of the CPU tothe viral code. The virus conceals its presence by passing control backto the host file after it has run by calling the original instructionsat the modified entry point.

Viruses use different techniques to infect the host file. For example, asimple virus always inserts the same viral body into the target file. Anencrypted virus infects a file by inserting an unchanging decryptionroutine and an encrypted viral body into the target file. A polymorphicencrypted virus (a “polymorphic virus”) is similar to an encryptedvirus, except that a polymorphic virus generates a new decryptionroutine each time it infects a file. A metamorphic virus is notencrypted, but it reorders the instructions in the viral body into afunctionally equivalent, but different, virus each time it infects afile. Simple and encrypted viruses can typically be detected by scanningfor strings in the viral body or encryption engine, respectively. Sincepolymorphic and metamorphic viruses usually do not have static signaturestrings, polymorphic and metamorphic viruses can typically be detectedby emulating the virus until either the static viral body is decryptedor the virus otherwise becomes apparent. While this description refersto simple, encrypted, polymorphic, and metamorphic viruses, it should beunderstood that the present invention can be used to detect any type ofvirus, regardless of whether the virus fits into one of the categoriesdescribed above.

A virus typically infects an executable file by attaching or alteringcode at or near an entry point of the file. An “entry point” is anyinstruction or instructions in the file that a virus can modify to gaincontrol of the computer system on which the file is being executed. Anentry point is typically identified by an offset from some arbitrarypoint in the file. Certain entry points are located at the beginning ofa file or region and, thus, are always invoked when the file or regionis executed. For example, an entry point can be the first instructionexecuted when a file is executed or a function within the file iscalled. Other entry points may consist of single instructions deepwithin a file that can be modified by a virus. For example, the entrypoint can be a CALL or JMP instruction that is modified to invoke viralcode. Once a virus seizes control of the computer system through theentry point, the virus typically infects other files on the system.

FIG. 1 is a high-level block diagram of an executable file 100 havingmultiple entry points that can be infected by a virus as describedabove. In the example illustrated by FIG. 1, the executable file is aWin32 portable executable (PE) file intended for use with a MICROSOFTWINDOWS-based operating system (OS), such as WINDOWS 98, WINDOWS NT, andWINDOWS 2000. Typically, the illustrated file 100 is of the type .EXE,indicating that the file is an executable file, or .DLL, indicating thatthe file is a dynamic link library (DLL). However, the present inventioncan be used with any file, and is not limited to only the type of fileillustrated in FIG. 1. APPLE MACINTOSH files, for example, share manysimilarities with Win32 files, and the present invention is equallyapplicable to such files.

The file 100 is divided into sections containing either code or data andaligned along four kilobyte (KB) boundaries. The MS-DOS section 102contains the MS-DOS header 102 and is marked by the characters “MZ”.This section 102 contains a small executable program 103 designed todisplay an error message if the executable file is run in an unsupportedOS (e.g., MS-DOS). This program 103 is an entry point for the file 100.The MS-DOS section 102 also contains a field 104 holding the relativeoffset to the start 108 of the PE section 106. This field 104 is anotherentry point for the file 100.

The PE section 106 is marked by the characters “PE” and holds a datastructure 110 containing basic information about the file 100. The datastructure 110 holds many data fields describing various aspects of thefile 100. One such field is the “checksum” field 111, which is rarelyused by the OS.

The next section 112 holds the section table 114. The section table 114contains information about each section in the file 100, including thesection's type, size, and location in the file 100. For example, entriesin the section table 114 indicate whether a section holds code or data,and whether the section is readable, writeable, and/or executable. Eachentry in the section table 114 describes a section that may havemultiple, one, or no entry points.

The text section 116 holds general-purpose code produced by the compileror assembler. The data section 118 holds global and static variablesthat are initialized at compile time.

The export section 120 contains an export table 122 that identifiesfunctions exported by the file 100 for use by other programs. An EXEfile might not export any functions but DLL files typically export somefunctions. The export table 122 holds the function names, entry pointaddresses, and export ordinal values for the exported functions. Theentry point addresses typically point to other sections in the file 100.Each exported function listed in the export table 122 is an entry pointinto the file 100.

The import section 124 has an import table 126 that identifies functionsthat are imported by the file 100. Each entry in the import table 126identifies the external DLL and the imported function by name. When codein the text section 116 calls a function in another module, such as anexternal DLL file, the call instruction transfers control to a JMPinstruction also in the text section 116. The JMP instruction, in turn,directs the call to a location within the import table 126. Both the JMPinstruction and the entries in the import table 126 represent entrypoints into the file 100. Additional information about the Win32 fileformat is found in M. Pietrek, “Peering Inside the PE: A Tour of theWin32 Portable Executable File Format,” Microsoft Systems Journal, March1994, which is hereby incorporated by reference.

FIG. 2 is a high-level block diagram of a computer system 200 forstoring and executing the host file 100 and a virus detection system(VDS) 400. Illustrated are at least one processor 202 coupled to a bus204. Also coupled to the bus 204 are a memory 206, a storage device 208,a keyboard 210, a graphics adapter 212, a pointing device 214, and anetwork adapter 216. A display 218 is coupled to the graphics adapter212.

The at least one processor 202 may be any general-purpose processor suchas an INTEL x86, SUN MICROSYSTEMS SPARC, or POWERPC compatible-CPU. Thestorage device 208 may be any device capable of holding data, like ahard drive, compact disk read-only memory (CD-ROM), DVD, or asolid-state memory device. The memory 206 holds instructions and dataused by the processor 202. The pointing device 214 may be a mouse, trackball, light pen, touch-sensitive display, or other type of pointingdevice, and is used in combination with the keyboard 210 to input datainto the computer system 200. The graphics adapter 212 displays imagesand other information on the display 218. The network adapter 216couples the computer system 200 to a local or wide area network.

Preferably, the host file 100 and program modules providing thefunctionality of the VDS 400 are stored on the storage device 208. Theprogram modules, according to one embodiment, are loaded into the memory206 and executed by the processor 202. Alternatively, hardware orsoftware modules for providing the functionality of the VDS 400 may bestored elsewhere within the computer system 200.

FIG. 3 is a flow chart illustrating steps performed by a typical viruswhen infecting the host file 100. The illustrated steps are merely anexample of a viral infection and are not representative of anyparticular virus. Initially, the virus executes 310 on the computersystem 200. The virus may execute, for example, when the computer system200 executes or calls a function in a previously-infected file.

When the host file 100 is opened, the virus appends 312 the viral codeto a location within the file. For example, the virus can append theviral body to the slack space at the end of a section or put the viralbody within an entirely new section. The virus can be, for example,simple, encrypted, polymorphic, or metamorphic.

The virus also modifies 314 the section table 114 to account for theadded viral code. For example, the virus may change the size entry inthe section table 114 to account for the added viral code. Likewise, thevirus may add entries for new sections added by the virus. If necessary,the virus may mark an infected section as executable and/or place avalue in a little used field, such as the checksum field 111, todiscreetly mark the file as infected and prevent the virus fromreinfecting the file 100.

In addition, the virus alters 316 an entry point of the file 100 to callthe viral code. The virus may accomplish this step by, for example,overwriting the value in the field 104 holding the relative offset tothe start 108 of the PE section 106 with the relative offset to viruscode stored elsewhere in the file. Alternatively, the virus can modifyentries in the export table 122 to point to sections of virus codeinstead of the exported functions. A virus can also modify thedestination of an existing JMP or CALL instruction anywhere in the file100 to point to the location of viral code elsewhere in the file,effectively turning the modified instruction into a new entry point forthe virus.

FIG. 4 is a high-level block diagram of the VDS 400 according to apreferred embodiment of the present invention. The VDS 400 includes aP-code data file 410, a virus definition file (VDF) 412, and an engine414. The P-code data file 410 holds P-code instructions for examiningthe host file 100. As used herein, “P-code” refers to program codeinstructions in an interpreted computer language. The P-code provides aTuring-equivalent programmable system which has all of the power of aprogram written in a more familiar language, such as C. Preferably, theP-code instructions in the data file 410 are created by writinginstructions in any computer language and then compiling theinstructions into P-code. Other portable, i.e., cross-platform,languages or instruction representations, such as JAVA, may be used aswell.

The VDF 412 preferably holds an entry or virus definition for each knownvirus. Each virus definition contains information specific to a virus orstrain of viruses, including a signature for identifying the virus orstrain. An entry in the VDF 412, according to an embodiment of thepresent invention, is organized as follows:

-   -   [VirusID]    -   0x2f41    -   [SigStart]    -   0x89, 0xb4, 0xb8, 0x02, 0x096, 0x56, DONE    -   [SigEnd]        Here, [VirusID] is a data field for a number that identifies the        specific virus or virus strain. [SigStart] and [SigEnd] bracket        a virus signature, which is a string of bytes characteristic of        the virus or strain having Virus ID 0x2f41. The signature, for        example, may identify the static encryption engine of an        encrypted virus or the static viral body of a polymorphic virus.        The virus signatures are used to detect the presence of a virus        in a file (or in the virtual memory 434 after emulating),        typically by performing a string scan for the bytes in the        signature. In one embodiment of the present invention, the VDF        412 holds virus definitions for thousands of viruses.

The engine 414 controls the operation of the VDS 400. The engine 414preferably contains a P-code interpreter 418 for interpreting the P-codein the P-code data file 410. The interpreted P-code controls theoperation of the engine 414. In alternative embodiments where the datafile 410 holds instructions in a format other than P-code, the engine414 is equipped with a module for interpreting or compiling theinstructions in the relevant format. For example, if the data file 410holds JAVA instructions, the engine 414 preferably includes a JAVAJust-in-Time compiler.

The P-code interpreter 418 preferably includes special P-code functioncalls called “primitives” 420. The primitives 420 can be, for example,written in P-code or a native language, and/or integrated into theinterpreter itself. Primitives 420 are essentially functions useful forexamining the host file 100 and the virtual memory 434 that can becalled by other P-code. For example, the primitives 420 performfunctions such as opening files for reading, closing files, zeroing outmemory locations, truncating memory locations, locating exports in thefile, determining the type of the file, and finding the offset of thestart of a function. The functions performed by the primitives 420 canvary depending upon the computer or operating system in which the VDS400 is being used. For example, different primitives may be utilized ina computer system running the MACINTOSH operating system than in acomputer system running a version of the WINDOWS operating system. In analternative embodiment, some or all of the primitives 416 can be storedin the P-code data file 410 instead of the interpreter 418.

The engine 414 also contains a scanning module 424 for scanning pages ofthe virtual memory 434 or regions of a file 100 for virus signaturesheld in the VDF 412. In one embodiment, the scanning module 424 receivesa range of memory addresses as parameters. The scanning module scans thememory addresses within the supplied range for signatures held in theVDF 412.

The engine 414 also contains an emulating module 426 for emulating codein the file 100 starting at an entry point. The emulating moduleincludes a control program 428 for setting up a virtual machine 430having a virtual processor 432 and an associated virtual memory 434. Thevirtual machine can emulate a 32-bit MICROSOFT WINDOWS environment, anAPPLE MACINTOSH environment, or any other environment for whichemulation is desired. The virtual machine 430 uses the virtual processor432 to execute code in the virtual memory 434 in isolation from theremainder of the computer system 200. Emulation starts with a givencontext, which specifies the contents of the registers, stacks, etc. inthe virtual processor 432. During emulation, every page of virtualmemory 434 that is read from, written to, or emulated through is marked.The number of instructions that the virtual machine 430 emulates can befixed at the beginning of emulation or can be determined adaptivelywhile the emulation occurs.

FIG. 5 is a flow chart illustrating steps performed by the VDS 400according to a preferred embodiment of the present invention. Thebehavior of the VDS 400 is controlled by the P-code. Since the P-codeprovides Turing machine-like functionality to the VDS 400, the VDS 400has an infinite set of possible behaviors. Accordingly, it should beunderstood that the steps illustrated in FIG. 5 represent only onepossible set of VDS 400 behaviors.

Initially, the engine 414 executes 510 the P-code in the P-code datafile 410. Next, the P-code determines 512 which areas of the file 100should be scanned for virus strings because the areas are likely tocontain a simple or encrypted virus. Areas of the file 100 that shouldbe scanned are posted 514 for later scanning. Typically, the main entrypoint of the PE header and the last section of the file 100 are alwaysposted 514 for string scanning because these are the areas most likelyto be infected by a virus. Any other region of the file can be posted514 for scanning if the regions seem suspicious. For example, if thedestination of a JMP or CALL instruction points to a suspicious locationin the file 100, it may be desirable to post the areas of the filesurrounding both the instruction and the destination.

For other regions of the file 100, the determination of whether to scanis made based on tell-tale markers set by the viruses, such as unusuallocations and lengths of sections, or unusual attribute settings offields within the sections. For example, if the value of an unusedfield, such as the checksum field 111, is set or the length of a sectionis suspiciously long, then the P-code posts 514 a region of the sectionfor scanning. Likewise, if a section that is normally not executable ismarked as executable, then the P-code preferably posts 514 a region ofthe section for scanning.

Next, the P-code determines 516 which entry points should be posted 518for emulating because the entry points are likely to execute polymorphicor metamorphic viruses. The P-code checks the main entry point 103 forknown non-viral code. If such code is not found, then the P-code poststhe main entry point 103 for emulating. Entry points in other regions ofthe file 100 are posted 518 for emulating if the code exhibits evidenceof viral infection. For example, an entry point in a region of the file100 is preferably posted for emulating if the checksum field 111 in theheader contains a suspicious value. When an entry point is posted foremulating, an emulation context, or starting state of the computersystem 200, is also specified.

The P-code can also identify 520 viruses in the file 100 withoutemulating or string searching. This identification is performedalgorithmically or stochastically using virus definitions written intothe P-code. The virus definitions preferably use the primitives 420 inthe interpreter 418 to directly test the file 100 for characteristics ofknown viruses. For example, if the last five bytes of a file or sectionhave a certain signature found in only one virus, or the file size isevenly divisible by 10, characteristics likely to occur only if the fileis infected by certain viruses, then the P-code can directly detect thepresence of the virus. In addition, the P-code can be enhanced withalgorithms and heuristics to detect the behavior of unknown viruses. Ifa virus is found 522 by the P-code, the VDS 400 can stop 524 searchingand report that the file 100 is infected with a virus.

Scan requests posted by the P-code are preferably merged and minimizedto reduce redundant scanning. For instance, a posted request to scanbytes 1000 to 1500, and another posted request to scan bytes 1200 to3000, are preferably merged into a single request to scan bytes 1000 to3000. Any merging algorithm known to those skilled in the art can beused to merge the scan requests. Posted emulating requests havingidentical contexts can also be merged, although such posts occur lessfrequently than do overlapping scan requests.

If the P-code does not directly detect 522 a virus, the VDS 400 nextpreferably performs scans on the posted regions of the file 100. The VDS400 executes 526 the scanning module 424 to scan the posted regions forthe virus signatures of simple and encrypted viruses found in the VDF412. If a virus is found 528 by the scanning module 424, the VDS 400stops scanning 524 and reports that the file 100 is infected with avirus.

If neither the P-code nor the scanning module 424 detects the presenceof a virus, the VDS 400 preferably utilizes a hook to execute 530 customvirus-detection code. The hook allows virus detection engineers toinsert custom virus detection routines written in C, C++, or any otherlanguage into the VDS 400. The custom detection routines may be usefulto detect unique viruses that are not practical to detect via the P-codeand string scanning. For example, it may be desired to use faster nativecode to detect a certain virus rather than the slower P-code. Alternateembodiments of the present invention may provide hooks to custom code atother locations in the program flow. If a virus is found 532 by thecustom code, the VDS 400 can stop searching 524 for a virus and reportthat the file 100 is infected.

If the P-code, scanning module 424, and custom code fail to detect avirus, the VDS 400 preferably executes the emulating module 426. Theemulating module 426 emulates 534 the code at the entry point posted bythe P-code in order to decrypt polymorphic viruses and trace throughcode to locate metamorphic viruses. Once enough instructions have beenemulated that any virus should become apparent (i.e., a polymorphicvirus has decrypted the static viral body or the code of a metamorphicvirus is recognized), the emulating module 426 preferably detects apolymorphic virus by using the scanning module 424 to scan pages ofvirtual memory 434 that were marked as modified or executed through forany virus signatures. The emulation module 426 preferably detects ametamorphic virus via stochastic information obtained during emulation,such as instruction usage profiles. If 536 a virus is found 534 by theemulating module 426, the VDS 400 reports that the file 100 is infected.Otherwise, the VDS 400 reports 538 that it did not detect a virus in thefile 100.

In sum, the VDS 400 according to the present invention uses P-code andprimitives 420 to extend the possible behaviors of the VDS. The P-codealso allows the VDS 400 to be updated to detect new viruses withoutcostly engine upgrades. In addition, the behavior of the VDS 400 isadapted to examine files having multiple entry points in a reasonableamount of time.

The above description is included to illustrate the operation of thepreferred embodiments and is not meant to limit the scope of theinvention. The scope of the invention is to be limited only by thefollowing claims. From the above discussion, many variations will beapparent to one skilled in the relevant art that would yet beencompassed by the spirit and scope of the invention.

1. A virus detection system for detecting if a computer file is infectedby a virus, the file having a plurality of potential virus entry points,the system comprising: an engine for controlling operation of the virusdetection system responsive to instructions stored in an intermediatelanguage, the instructions adapted to examine the plurality of potentialvirus entry points and post for emulating ones of the plurality ofpotential virus entry points exhibiting characteristics indicating apossible virus; an emulating module coupled to the engine for emulatingthe posted entry points of the file in a virtual memory responsive tothe engine, wherein the virus may become apparent during the emulationof an entry points of the file infected by the virus; and a scanningmodule coupled to the engine for scanning regions of the virtual memoryfor a signature of the virus responsive to the engine and the emulatingmodule, wherein presence of the virus signature in a scanned regionindicates that the file is infected by the virus.
 2. The virus detectionsystem of claim 1, further comprising: a custom module coupled to theengine for executing custom virus-detection code responsive toinvocation by the engine.
 3. The virus detection system of claim 1,wherein the intermediate language is P-code and the engine comprises: aP-code interpreter for interpreting the P-code and controlling theoperation of the virus detection system responsive thereto.
 4. The virusdetection system of claim 3, wherein the engine further comprises:primitives for performing operations with respect to the file and thevirtual memory responsive to invocations of the primitives by theP-code.
 5. The virus detection system of claim 1, further comprising: avirus definition file coupled to the scanning module for holding virussignatures for use by the scanning module.
 6. The virus detection systemof claim 1, wherein the instructions stored in the intermediate languagepost regions of the file for scanning by the scanning module.
 7. Thevirus detection system of claim 6, wherein postings identifyingoverlapping regions are merged into a single posting identifying theregions of the merged postings.
 8. A method for detecting a virus in acomputer file, the file having a plurality of potential virus entrypoints, the method comprising the steps of: executing instructionsstored in an intermediate language representation, the instructionsperforming the steps of: examining regions of the file for possibleinfection by viruses and posting for scanning any regions exhibitingcharacteristics indicating a possible virus infection; examining theplurality of potential virus entry points of the file for possibleinfections by viruses and posting for emulating ones of the plurality ofpotential virus entry points exhibiting characteristics indicating apossible virus infection; and examining the posted regions of the fileto algorithmically determine whether the file is infected with a virus.9. The method of claim 8, wherein the instructions further perform thesteps of: merging overlapping regions posted for scanning.
 10. Themethod of claim 8, wherein the instructions further perform the step of:calling a custom executable program to determine when the file isinfected with a virus.
 11. The method of claim 8, further comprising thestep of: scanning the regions of the file posted for scanning forsignatures of known viruses.
 12. The method of claim 8, furthercomprising the steps of: emulating the posted entry points in a virtualmemory to allow the viruses to become apparent; scanning the virtualmemory for signatures of the viruses; and examining stochasticinformation obtained during emulation to detect the presence of theknown viruses.
 13. The method of claim 8, wherein the step of examiningthe plurality of potential virus entry points of the file for possibleinfections by viruses and posting for emulating ones of the plurality ofpotential virus entry points exhibiting characteristics indicating apossible virus infection comprises the step of: determining if a mainentry point of the file has known non-viral code; wherein the main entrypoint is posted for emulating responsive to a determination that themain entry point does not have known non-viral code.
 14. A computerprogram product comprising: a computer usable medium having computerreadable code embodied therein for determining if a computer file isinfected by a virus, the file having a plurality of potential virusentry points, the computer readable code comprising: an engine forcontrolling the operation of the computer program product responsive toinstructions stored in an intermediate language, the instructionsadapted to examine the plurality of potential virus entry points andpost for emulating ones of the plurality of potential virus entry pointsexhibiting characteristics indicating a possible virus infection; anemulating module for emulating the posted entry points of the file in avirtual memory responsive to the engine, wherein the virus may becomeapparent during emulation of an entry points of the file infected by thevirus; and a scanning module for scanning regions of the virtual memoryfor a signature of the virus responsive to the engine and the emulatingmodule, wherein presence of the virus signature indicates that the fileis infected by the virus.
 15. The computer program product of claim 14,further comprising: a custom module for executing custom virus-detectioncode responsive to invocation by the engine.
 16. The computer programproduct of claim 14, wherein the intermediate language is P-code and theengine comprises: a P-code interpreter for interpreting the P-code andcontrolling the operation of the engine responsive thereto.
 17. Thecomputer program product of claim 16, wherein the engine furthercomprises: primitives for performing operations with respect to the fileand the virtual memory responsive to invocations of the primitives bythe P-code.
 18. The computer program product of claim 14, furthercomprising: a virus definition file for holding virus signatures for useby the scanning module.
 19. The computer program product of claim 14,wherein the instructions stored in the intermediate language postregions of the file for scanning by the scanning module.
 20. Thecomputer program product of claim 19, wherein postings identifyingoverlapping regions are merged into a single posting identifying theregions of the merged postings.